智能论文笔记

Interpretable ML for Imbalanced Data

Damien A. Dablain , Colin Bellinger , Bartosz Krawczyk , David W. Aha , Nitesh V. Chawla

分类：机器学习

2022-12-15

Deep learning models are being increasingly applied to imbalanced data in high stakes fields such as medicine, autonomous driving, and intelligence analysis. Imbalanced data compounds the black-box nature of deep networks because the relationships between classes may be highly skewed and unclear. This can reduce trust by model users and hamper the progress of developers of imbalanced learning algorithms. Existing methods that investigate imbalanced data complexity are geared toward binary classification, shallow learning models and low dimensional data. In addition, current eXplainable Artificial Intelligence (XAI) techniques mainly focus on converting opaque deep learning models into simpler models (e.g., decision trees) or mapping predictions for specific instances to inputs, instead of examining global data properties and complexities. Therefore, there is a need for a framework that is tailored to modern deep networks, that incorporates large, high dimensional, multi-class datasets, and uncovers data complexities commonly found in imbalanced data (e.g., class overlap, sub-concepts, and outlier instances). We propose a set of techniques that can be used by both deep learning model users to identify, visualize and understand class prototypes, sub-concepts and outlier instances; and by imbalanced learning algorithm developers to detect features and class exemplars that are key to model performance. Our framework also identifies instances that reside on the border of class decision boundaries, which can carry highly discriminative information. Unlike many existing XAI techniques which map model decisions to gray-scale pixel locations, we use saliency through back-propagation to identify and aggregate image color bands across entire classes. Our framework is publicly available at \url{https://github.com/dd1github/XAI_for_Imbalanced_Learning}

translated by 谷歌翻译

Towards A Holistic View of Bias in Machine Learning: Bridging Algorithmic Fairness and Imbalanced Learning

Damien Dablain , Bartosz Krawczyk , Nitesh Chawla

分类：机器学习

2022-07-13

机器学习（ML）在渲染影响社会各个群体的决策中起着越来越重要的作用。 ML模型为刑事司法的决定，银行业中的信贷延长以及公司的招聘做法提供了信息。这提出了模型公平性的要求，这表明自动化的决策对于受保护特征（例如，性别，种族或年龄）通常是公平的，这些特征通常在数据中代表性不足。我们假设这个代表性不足的问题是数据学习不平衡问题的必然性。此类不平衡通常反映在两个类别和受保护的功能中。例如，一个班级（那些获得信用的班级）对于另一个班级（未获得信用的人）可能会过分代表，而特定组（女性）（女性）的代表性可能与另一组（男性）有关。相对于受保护组的算法公平性的关键要素是同时减少了基础培训数据中的类和受保护的群体失衡，这促进了模型准确性和公平性的提高。我们通过展示这些领域中的关键概念如何重叠和相互补充，讨论弥合失衡学习和群体公平的重要性；并提出了一种新颖的过采样算法，即公平的过采样，该算法既解决偏斜的类别分布和受保护的特征。我们的方法：（i）可以用作标准ML算法的有效预处理算法，以共同解决不平衡和群体权益；（ii）可以与公平感知的学习算法结合使用，以提高其对不同水平不平衡水平的稳健性。此外，我们迈出了一步，将公平和不平衡学习之间的差距与新的公平实用程序之间的差距弥合，从而将平衡的准确性与公平性结合在一起。

translated by 谷歌翻译

Efficient Augmentation for Imbalanced Deep Learning

Damien Dablain , Colin Bellinger , Bartosz Krawczyk , Nitesh Chawla

分类：机器学习

2022-07-13

深度学习模型记住培训数据，这损害了他们推广到代表性不足的课程的能力。我们从经验上研究了卷积神经网络对图像数据不平衡数据的内部表示，并测量了训练和测试集中模型特征嵌入之间的概括差距，这表明该差距对于少数类别的差异更大。这个洞察力使我们能够为不平衡数据设计有效的三相CNN培训框架。该框架涉及训练网络端到端的数据不平衡数据以学习准确的功能嵌入，在学习的嵌入式空间中执行数据增强以平衡火车分布，并在嵌入式平衡的培训数据上微调分类器头。我们建议在培训框架中使用广泛的过采样（EOS）作为数据增强技术。 EOS形成合成训练实例，作为少数族类样本与其最近的敌人之间的凸组合，以减少概括差距。提出的框架提高了与不平衡学习中常用的领先成本敏感和重新采样方法的准确性。此外，它比标准数据预处理方法（例如SMOTE和基于GAN的过采样）更有效，因为它需要更少的参数和更少的训练时间。

translated by 谷歌翻译

Mining Drifting Data Streams on a Budget: Combining Active Learning with Self-Labeling

Łukasz Korycki , Bartosz Krawczyk

分类：机器学习

2021-12-21

挖掘数据流姿势存在许多挑战，包括数据的连续和非静止性质，待处理的大量信息和限制计算资源。虽然在文献中提出了一些针对这个问题的监督解决方案，但大多数人都假定访问地面真理（以类标签的形式）是无限的，并且在更新学习系统时可以立即使用此类信息。这远非现实，因为必须考虑获取标签的基本成本。因此，需要解决流方案中实际真相要求的解决方案。在本文中，通过组合来自主动学习和自我标签的信息，提出了一种用于预算的挖水数据流的新框架。我们介绍了几种策略，可以利用智能实例选择和半监督程序，同时考虑到概念漂移的潜在存在。这种混合方法允许有效的探索和利用在现实标记预算中的流数据结构。由于我们的框架工作为包装器，因此它可以应用于不同的学习算法。实验研究，在具有各种类型的概念漂移的多样化现实数据流中进行的实验研究，证明了在处理对类标签的高度限制时拟议的策略的有用性。当一个人不能增加标签或更换低效分类器的预算时，呈现的混合方法尤其可行。我们为我们的战略提供了一套关于适用性领域的建议。

translated by 谷歌翻译

Climate Policy Tracker: Pipeline for automated analysis of public climate policies

Artur Żółkowski , Mateusz Krzyziński , Piotr Wilczyński , Stanisław Giziński , Emilia Wiśnios , Bartosz Pieliński , Julian Sienkiewicz , Przemysław Biecek

分类：自然语言处理

2022-11-10

The number of standardized policy documents regarding climate policy and their publication frequency is significantly increasing. The documents are long and tedious for manual analysis, especially for policy experts, lawmakers, and citizens who lack access or domain expertise to utilize data analytics tools. Potential consequences of such a situation include reduced citizen governance and involvement in climate policies and an overall surge in analytics costs, rendering less accessibility for the public. In this work, we use a Latent Dirichlet Allocation-based pipeline for the automatic summarization and analysis of 10-years of national energy and climate plans (NECPs) for the period from 2021 to 2030, established by 27 Member States of the European Union. We focus on analyzing policy framing, the language used to describe specific issues, to detect essential nuances in the way governments frame their climate policies and achieve climate goals. The methods leverage topic modeling and clustering for the comparative analysis of policy documents across different countries. It allows for easier integration in potential user-friendly applications for the development of theories and processes of climate policy. This would further lead to better citizen governance and engagement over climate policies and public policy research.

translated by 谷歌翻译

Quantification of entanglement with Siamese convolutional neural networks

Jarosław Pawłowski , Mateusz Krawczyk

分类：人工智能

2022-10-13

Quantum entanglement is a fundamental property commonly used in various quantum information protocols and algorithms. Nonetheless, the problem of quantifying entanglement has still not reached general solution for systems larger than two qubits. In this paper, we investigate the possibility of detecting entanglement with the use of the supervised machine learning method, namely the deep convolutional neural networks. We build a model consisting of convolutional layers, which is able to recognize and predict the presence of entanglement for any bipartition of the given multi-qubit system. We demonstrate that training our model on synthetically generated datasets collecting random density matrices, which either include or exclude challenging positive-under-partial-transposition entangled states (PPTES), leads to the different accuracy of the model and its possibility to detect such states. Moreover, it is shown that enforcing entanglement-preserving symmetry operations (local operations on qubit or permutations of qubits) by using triple Siamese network, can significantly increase the model performance and ability to generalize on types of states not seen during the training stage. We perform numerical calculations for 3,4 and 5-qubit systems, therefore proving the scalability of the proposed approach.

translated by 谷歌翻译

Active Few-Shot Classification: a New Paradigm for Data-Scarce Learning Settings

Aymane Abdali , Vincent Gripon , Lucas Drumetz , Bartosz Boguslawski

分类：机器学习

2022-09-23

我们考虑了一个新颖的表述，即主动射击分类（AFSC）的问题，其目的是对标签预算非常限制的小规定，最初未标记的数据集进行分类。这个问题可以看作是与经典的跨托管少数射击分类（TFSC）的竞争对手范式，因为这两种方法都适用于相似的条件。我们首先提出了一种结合统计推断的方法，以及一种非常适合该框架的原始两级积极学习策略。然后，我们从TFSC领域调整了几个标准视觉基准。我们的实验表明，AFSC的潜在优势可能是很大的，与最先进的TFSC方法相比，对于同一标签预算，平均加权准确性高达10％。我们认为，这种新的范式可能会导致数据筛选学习设置的新发展和标准。

translated by 谷歌翻译

Deep learning automates bidimensional and volumetric tumor burden measurement from MRI in pre- and post-operative glioblastoma patients

Jakub Nalepa , Krzysztof Kotowski , Bartosz Machura , Szymon Adamski , Oskar Bozek , Bartosz Eksner , Bartosz Kokoszka , Tomasz Pekala , Mateusz Radom , Marek Strzelczak

分类：计算机视觉

2022-09-03

通过磁共振成像（MRI）评估肿瘤负担对于评估胶质母细胞瘤的治疗反应至关重要。由于疾病的高异质性和复杂性，该评估的性能很复杂，并且与高变异性相关。在这项工作中，我们解决了这个问题，并提出了一条深度学习管道，用于对胶质母细胞瘤患者进行全自动的端到端分析。我们的方法同时确定了肿瘤的子区域，包括第一步的肿瘤，周围肿瘤和手术腔，然后计算出遵循神经符号学（RANO）标准的当前响应评估的体积和双相测量。此外，我们引入了严格的手动注释过程，其随后是人类专家描绘肿瘤子区域的，并捕获其分割的信心，后来在训练深度学习模型时被使用。我们广泛的实验研究的结果超过了760次术前和504例从公共数据库获得的神经胶质瘤后患者（2021 - 2020年在19个地点获得）和临床治疗试验（47和69个地点，可用于公共数据库（在19个地点获得）（47和69个地点）术前/术后患者，2009-2011）并以彻底的定量，定性和统计分析进行了备份，表明我们的管道在手动描述时间的一部分中对术前和术后MRI进行了准确的分割（最高20比人更快。二维和体积测量与专家放射科医生非常吻合，我们表明RANO测量并不总是足以量化肿瘤负担。

translated by 谷歌翻译

Entity Graph Extraction from Legal Acts -- a Prototype for a Use Case in Policy Design Analysis

Anna Wróblewska , Bartosz Pieliński , Karolina Seweryn , Karol Saputa , Aleksandra Wichrowska , Sylwia Sysko-Romańczuk , Hanna Schreiber

分类：自然语言处理

2022-09-02

本文介绍了有关开发的原型的研究，以服务公共政策设计的定量研究。政治学的这种子学科着重于确定参与者，之间的关系以及在健康，环境，经济和其他政策方面可以使用的工具。我们的系统旨在自动化收集法律文件，用机构语法注释它们的过程，并使用超图来分析关键实体之间的相互关系。我们的系统经过了《联合国教科文组织公约》的保护，以保护2003年的无形文化遗产，这是一份法律文件，该文件规定了确保文化遗产的国际关系的基本方面。

translated by 谷歌翻译

HTML版本

ProPaLL: Probabilistic Partial Label Learning

Łukasz Struski , Jacek Tabor , Bartosz Zieliński

分类：机器学习 | 人工智能

2022-08-21

部分标签学习是一种弱监督的学习，每个培训实例都对应于一组候选标签，其中只有一个是正确的。在本文中，我们介绍了一种针对此问题的新型概率方法，与现有方法相比，该方法至少具有三个优势：它简化了训练过程，改善了性能并可以应用于任何深层体系结构。对人工和现实世界数据集进行的实验表明，诺言的表现优于现有方法。

translated by 谷歌翻译